In this notebook, I will be putting the recommendation skills I have acquired so far to use on a real data from the IBM Watson Studio platform.
The table of contents illustrate the different methods for recommendations that can be used for different situations.
I. Exploratory Data Analysis
II. Rank Based Recommendations
III. User-User Based Collaborative Filtering
IV. Matrix Factorization
V. Extras & Concluding
Let's start by importing the necessary libraries and reading in the data.
# Default libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import project_tests as t
import pickle
# Other imported libraries
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
# Importing metric scores
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
%matplotlib inline
df = pd.read_csv('data/user-item-interactions.csv')
df_content = pd.read_csv('data/articles_community.csv')
del df['Unnamed: 0']
del df_content['Unnamed: 0']
# Show df to get an idea of the data
df.head()
| article_id | title | ||
|---|---|---|---|
| 0 | 1430.0 | using pixiedust for fast, flexible, and easier... | ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7 |
| 1 | 1314.0 | healthcare python streaming application demo | 083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b |
| 2 | 1429.0 | use deep learning for image classification | b96a4f2e92d8572034b1e9b28f9ac673765cd074 |
| 3 | 1338.0 | ml optimization using cognitive assistant | 06485706b34a5c9bf2a0ecdac41daf7e7654ceb7 |
| 4 | 1276.0 | deploy your python model as a restful api | f01220c46fc92c6e6b161b1849de11faacd7ccb2 |
# Show df_content to get an idea of the data
df_content.head()
| doc_body | doc_description | doc_full_name | doc_status | article_id | |
|---|---|---|---|---|---|
| 0 | Skip navigation Sign in SearchLoading...\r\n\r... | Detect bad readings in real time using Python ... | Detect Malfunctioning IoT Sensors with Streami... | Live | 0 |
| 1 | No Free Hunch Navigation * kaggle.com\r\n\r\n ... | See the forest, see the trees. Here lies the c... | Communicating data science: A guide to present... | Live | 1 |
| 2 | ☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat... | Here’s this week’s news in Data Science and Bi... | This Week in Data Science (April 18, 2017) | Live | 2 |
| 3 | DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA... | Learn how distributed DBs solve the problem of... | DataLayer Conference: Boost the performance of... | Live | 3 |
| 4 | Skip navigation Sign in SearchLoading...\r\n\r... | This video demonstrates the power of IBM DataS... | Analyze NY Restaurant data using Spark in DSX | Live | 4 |
The goal is to recommend articles to real users. By real users I mean, users with email addresses. In our dataset, that corresponds to where an entry in the column field labelled email is not missing. Since we would like to recommend articles to real users, we may want to remove all missing data, precisely rows in dataset df with missing data in the email column.
There are $17$ missing data in the email column of the dataset df.
df.isnull().sum()['email']
17
The complete dataframe $17$ rows with missing data in the email column is provided in the next cell below.
nan_email_df = df.loc[df.email[df.email.isnull()].index]
nan_email_df
| article_id | title | ||
|---|---|---|---|
| 25131 | 1016.0 | why you should master r (even if it might even... | NaN |
| 29758 | 1393.0 | the nurse assignment problem | NaN |
| 29759 | 20.0 | working interactively with rstudio and noteboo... | NaN |
| 29760 | 1174.0 | breast cancer wisconsin (diagnostic) data set | NaN |
| 29761 | 62.0 | data visualization: the importance of excludin... | NaN |
| 35264 | 224.0 | using apply, sapply, lapply in r | NaN |
| 35276 | 961.0 | beyond parallelize and collect | NaN |
| 35277 | 268.0 | sector correlations shiny app | NaN |
| 35278 | 268.0 | sector correlations shiny app | NaN |
| 35279 | 268.0 | sector correlations shiny app | NaN |
| 35280 | 268.0 | sector correlations shiny app | NaN |
| 35281 | 415.0 | using machine learning to predict value of hom... | NaN |
| 35282 | 846.0 | pearson correlation aggregation on sparksql | NaN |
| 35283 | 268.0 | sector correlations shiny app | NaN |
| 35284 | 162.0 | an introduction to stock market data analysis ... | NaN |
| 42749 | 647.0 | getting started with apache mahout | NaN |
| 42750 | 965.0 | data visualization playbook: revisiting the ba... | NaN |
nan_email_df.groupby('article_id').count().sort_values(by=['title'], ascending=False)
| title | ||
|---|---|---|
| article_id | ||
| 268.0 | 5 | 0 |
| 20.0 | 1 | 0 |
| 62.0 | 1 | 0 |
| 162.0 | 1 | 0 |
| 224.0 | 1 | 0 |
| 415.0 | 1 | 0 |
| 647.0 | 1 | 0 |
| 846.0 | 1 | 0 |
| 961.0 | 1 | 0 |
| 965.0 | 1 | 0 |
| 1016.0 | 1 | 0 |
| 1174.0 | 1 | 0 |
| 1393.0 | 1 | 0 |
Among the $13$ articles with ids [268.0, 20.0, 62.0, 162.0, 224.0, 415.0, 647.0, 846.0, 961.0, 965.0, 1016.0, 1174.0, 1393.0], the article with the id 268.0 has interacted mostly among the users with missing emails.
I will remove all the missing data in the dataframe df.
df_data = df.dropna()
df_data.head()
| article_id | title | ||
|---|---|---|---|
| 0 | 1430.0 | using pixiedust for fast, flexible, and easier... | ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7 |
| 1 | 1314.0 | healthcare python streaming application demo | 083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b |
| 2 | 1429.0 | use deep learning for image classification | b96a4f2e92d8572034b1e9b28f9ac673765cd074 |
| 3 | 1338.0 | ml optimization using cognitive assistant | 06485706b34a5c9bf2a0ecdac41daf7e7654ceb7 |
| 4 | 1276.0 | deploy your python model as a restful api | f01220c46fc92c6e6b161b1849de11faacd7ccb2 |
In the cells below, I will provide some insight into the descriptive statistics of the data.
1. What is the distribution of how many articles a user interacts with in the dataset? Provide a visual and descriptive statistics to assist with giving a look at the number of times each user interacts with an article.
# Observe that the email are the users and the article_id are the articles.
user_article_df = df_data.groupby('email')['article_id'].count().reset_index();
user_article_df.head()
| article_id | ||
|---|---|---|
| 0 | 0000b6387a0366322d7fbfc6434af145adf7fed1 | 13 |
| 1 | 001055fc0bb67f71e8fa17002342b256a30254cd | 4 |
| 2 | 00148e4911c7e04eeff8def7bbbdaf1c59c2c621 | 3 |
| 3 | 001a852ecbd6cc12ab77a785efa137b2646505fe | 6 |
| 4 | 001fc95b90da5c3cb12c501d201a915e4f093290 | 2 |
hist_fig = px.histogram(user_article_df['article_id'],
nbins=50,
title="Distribution of the number of article interactions per user",
labels = {'value': "Number of articles per user"},
template= "plotly_dark"
)
hist_fig.update_layout(title_x=0.5,
yaxis_title="Number of users",
showlegend=False
)
hist_fig.show(config={'displaylogo': False})
# 50% of individuals interact with 3 number of articles or fewer.
median_val = user_article_df.describe().loc['50%']['article_id']
median_val
3.0
# The maximum number of user-article interactions by any 1 user is 364.
max_views_by_user = user_article_df.describe().loc['max']['article_id']
max_views_by_user
364.0
2. Explore and remove duplicate articles from the df_content dataframe.
# Find and explore duplicate articles
# Get the article_id's that are duplicated.
duplicate_ids = df_content[df_content.article_id.duplicated()].article_id.unique()
# Dataframe of all duplicated articles in df_content
df_content[df_content.article_id.isin(duplicate_ids)].sort_values(by=['article_id'])
| doc_body | doc_description | doc_full_name | doc_status | article_id | |
|---|---|---|---|---|---|
| 50 | Follow Sign in / Sign up Home About Insight Da... | Community Detection at Scale | Graph-based machine learning | Live | 50 |
| 365 | Follow Sign in / Sign up Home About Insight Da... | During the seven-week Insight Data Engineering... | Graph-based machine learning | Live | 50 |
| 221 | * United States\r\n\r\nIBM® * Site map\r\n\r\n... | When used to make sense of huge amounts of con... | How smart catalogs can turn the big data flood... | Live | 221 |
| 692 | Homepage Follow Sign in / Sign up Homepage * H... | One of the earliest documented catalogs was co... | How smart catalogs can turn the big data flood... | Live | 221 |
| 232 | Homepage Follow Sign in Get started Homepage *... | If you are like most data scientists, you are ... | Self-service data preparation with IBM Data Re... | Live | 232 |
| 971 | Homepage Follow Sign in Get started * Home\r\n... | If you are like most data scientists, you are ... | Self-service data preparation with IBM Data Re... | Live | 232 |
| 399 | Homepage Follow Sign in Get started * Home\r\n... | Today’s world of data science leverages data f... | Using Apache Spark as a parallel processing fr... | Live | 398 |
| 761 | Homepage Follow Sign in Get started Homepage *... | Today’s world of data science leverages data f... | Using Apache Spark as a parallel processing fr... | Live | 398 |
| 578 | This video shows you how to construct queries ... | This video shows you how to construct queries ... | Use the Primary Index | Live | 577 |
| 970 | This video shows you how to construct queries ... | This video shows you how to construct queries ... | Use the Primary Index | Live | 577 |
# Remove any rows that have the same article_id - only keep the first
df_content.drop_duplicates(subset=["article_id"], keep="first", inplace=True)
df_content.head()
| doc_body | doc_description | doc_full_name | doc_status | article_id | |
|---|---|---|---|---|---|
| 0 | Skip navigation Sign in SearchLoading...\r\n\r... | Detect bad readings in real time using Python ... | Detect Malfunctioning IoT Sensors with Streami... | Live | 0 |
| 1 | No Free Hunch Navigation * kaggle.com\r\n\r\n ... | See the forest, see the trees. Here lies the c... | Communicating data science: A guide to present... | Live | 1 |
| 2 | ☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat... | Here’s this week’s news in Data Science and Bi... | This Week in Data Science (April 18, 2017) | Live | 2 |
| 3 | DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA... | Learn how distributed DBs solve the problem of... | DataLayer Conference: Boost the performance of... | Live | 3 |
| 4 | Skip navigation Sign in SearchLoading...\r\n\r... | This video demonstrates the power of IBM DataS... | Analyze NY Restaurant data using Spark in DSX | Live | 4 |
3. Find the following:
a. The number of unique articles that have an interaction with a user.
# The number of unique articles that have at least one interaction
unique_articles = df.article_id.nunique()
unique_articles
714
b. The number of unique articles in the dataset (whether they have any interactions or not).
# The number of unique articles on the IBM platform
total_articles = df_content.article_id.nunique()
total_articles
1051
c. The number of unique users in the dataset. (excluding null values)
# The number of unique users
unique_users = df_data.email.nunique()
unique_users
5148
d. The number of user-article interactions in the dataset.
# The number of user-article interactions. Note that this includes unknown users, i.e.,
# missing data in the email column.
user_article_interactions = df.shape[0]
user_article_interactions
45993
4. Let us find the most viewed article_id, as well as how often it was viewed.
# The most viewed article in the dataset was viewed how many times?
most_viewed_article_df = df.groupby("article_id")['email'].count().sort_values(ascending=False).reset_index()[0:1]
most_viewed_article_df
| article_id | ||
|---|---|---|
| 0 | 1429.0 | 937 |
# The most viewed article in the dataset as a string with one value following the decimal
most_viewed_article_id = most_viewed_article_df.loc[0]['article_id']
# Get the title of the most viewd article
most_viewed_article_title = df[df.article_id == most_viewed_article_id].drop_duplicates(subset=['article_id'])['title'].values[0].capitalize()
# The most viewed article in the dataset was viewed how many times?
max_views = most_viewed_article_df.loc[0]['email']
print("The most viewed article is '{}', with article id {}. It had {} number of reads.".format(most_viewed_article_title,
most_viewed_article_id,
max_views
))
The most viewed article is 'Use deep learning for image classification', with article id 1429.0. It had 937.0 number of reads.
After talking to the company leaders, the email_mapper function was deemed a reasonable way to map users to ids. There were a small number of null values, and it was found that all of these null values likely belonged to a single user (which is how they are stored using the function below).
def email_mapper():
"""
INPUT:
- None
OUTPUT:
- email_encoded - A Python list - A list of integers which are the user ids
"""
coded_dict = dict()
counter = 1
email_encoded = []
# For each email or user in df['email']
for val in df['email']:
# If the email or user is not in coded_dict
if val not in coded_dict:
# Update coded_dict by setting the key as val,
# i.e, user and its value as counter
coded_dict[val] = counter
# Increment counter by 1
counter+=1
# Add the value of the key, i.e., the user to the list email_encoded
email_encoded.append(coded_dict[val])
# Return email_encoded
return email_encoded
email_encoded = email_mapper()
del df['email']
df['user_id'] = email_encoded
# show header
df.head()
| article_id | title | user_id | |
|---|---|---|---|
| 0 | 1430.0 | using pixiedust for fast, flexible, and easier... | 1 |
| 1 | 1314.0 | healthcare python streaming application demo | 2 |
| 2 | 1429.0 | use deep learning for image classification | 3 |
| 3 | 1338.0 | ml optimization using cognitive assistant | 4 |
| 4 | 1276.0 | deploy your python model as a restful api | 5 |
Unlike in the earlier lessons, we don't actually have ratings for whether a user liked an article or not. We only know that a user has interacted with an article. In these cases, the popularity of an article can really only be based on how often an article was interacted with.
1. The function below returns the top n articles ordered with most interactions on the top.
def get_top_articles(n, df=df):
'''
INPUT:
n - (int) the number of top articles to return
df - (pandas dataframe) df as defined at the top of the notebook
OUTPUT:
top_articles - (list) A list of titles of the the top 'n' article.
'''
# Your code here
# Get the first n most interracted article ids
top_n_article_ids = df['article_id'].value_counts().index[:n]
# Get the titles of the articles with ids in top_n_article_ids
top_article_titles = df[df['article_id'].isin(top_n_article_ids)]['title'].unique().tolist()
# Return the top article titles from df (not df_content)
return top_article_titles
def get_top_article_ids(n, df=df):
'''
INPUT:
n - (int) the number of top articles to return
df - (pandas dataframe) df as defined at the top of the notebook
OUTPUT:
top_articles - (list) A list of the id's of the top 'n' article titles
'''
# Your code here
top_article_ids = df['article_id'].value_counts().index[:n].tolist()
# Return the top article ids
return top_article_ids
print(get_top_articles(10))
print(get_top_article_ids(10))
['healthcare python streaming application demo', 'use deep learning for image classification', 'apache spark lab, part 1: basic concepts', 'predicting churn with the spss random tree algorithm', 'analyze energy consumption in buildings', 'visualize car data with brunel', 'use xgboost, scikit-learn & ibm watson machine learning apis', 'gosales transactions for logistic regression model', 'insights from new york car accident reports', 'finding optimal locations of new store using decision optimization'] [1429.0, 1330.0, 1431.0, 1427.0, 1364.0, 1314.0, 1293.0, 1170.0, 1162.0, 1304.0]
1. The function below reformats the df dataframe to be shaped with users as the rows and articles as the columns, with the following specifications:
# create the user-article matrix with 1's and 0's
def create_user_item_matrix(df):
'''
INPUT:
df - pandas dataframe with article_id, title, user_id columns
OUTPUT:
user_item - user item matrix
Description:
Return a matrix with user ids as rows and article ids on the columns with 1 values where a user interacted with
an article and a 0 otherwise
'''
# Remove all duplicates in the columns article_id and user_id
df = df.drop_duplicates(subset=['article_id', 'user_id'])
# Perform a simple cross tabulation, i.e., compute a frequency table of the columns user_id and article_id
user_item = pd.crosstab(df['user_id'], df['article_id'])
# return the user_item matrix
return user_item
user_item = create_user_item_matrix(df)
## Tests: You should just need to run this cell. Don't change the code.
assert user_item.shape[0] == 5149, "Oops! The number of users in the user-article matrix doesn't look right."
assert user_item.shape[1] == 714, "Oops! The number of articles in the user-article matrix doesn't look right."
assert user_item.sum(axis=1)[1] == 36, "Oops! The number of articles seen by user 1 doesn't look right."
print("You have passed our quick tests! Please proceed!")
You have passed our quick tests! Please proceed!
2. The function below takes a user_id and provide an ordered list of the most similar users to that user (from most similar to least similar). The returned result does not contain the provided user_id, since we already know that each user is similar to him/herself. Note that, because the results for each user here are binary, it (perhaps) makes sense to compute similarity as the dot product of two users.
def find_similar_users(user_id, user_item=user_item):
'''
INPUT:
user_id - (int) a user_id
user_item - (pandas dataframe) matrix of users by articles:
1's when a user has interacted with an article, 0 otherwise
OUTPUT:
similar_users - (list) an ordered list where the closest users (largest dot product users)
are listed first
Description:
Computes the similarity of every pair of users based on the dot product
Returns an ordered
'''
## Compute similarity of each user to the provided user. This involves the following steps.
# Get the data associated with the user whose id is `user_id`.
# Note that this encodes the articles that the user has interracted with.
user_data = user_item[user_item.index==user_id].values
# Take the dot product of the data associated with the user whose id is `user_id`, i.e., `user_data`,
# with all the users in the dataframe `user_item`. Note that `user_item`.values is encoded as a matrix
# with number of users as rows and number of articles as column. Hence, taking the transponse will
# will ensure the dot product of the data associated with `user_id` with all the other users.
user_data_dot_all_users = np.dot(user_data, user_item.values.T)[0]
## sort by similarity
# Get index of the dataframe object `user_item`
users_indexes = user_item.index.values
# Form a dataframe with the `user_data_dot_all_users`. Note that the positions of the entries of
# `user_data_dot_all_users` corresponds to that of `users_indexes`. In effect the similarity of the
# user whose id is `user_id` to the i-th user, namely, `user_indexes[i]` is given by
# `user_data_dot_all_users[i]`.
user_users_similarity_df = pd.DataFrame(user_data_dot_all_users, index=users_indexes, columns=['similarity'])
# Sort the similarities of the users in increasing order
user_users_similarity_df.sort_values(by=['similarity'], ascending=False, inplace=True)
## remove the own user's id, that is, remove `user_id` from the index of the dataframe `user_users_similarity_df`.
user_users_similarity_df.drop(labels=[user_id], inplace=True)
# Get the indexes of the dataframe `user_users_similarity_df`. Note that these are all other users in the dataframe
# `user_item` except for the user whose id is `user_id`.
most_similar_users = user_users_similarity_df.index.to_list()
# Return a list of the users in order from most to least similar
return most_similar_users
# Do a spot check of your function
print("The 10 most similar users to user 1 are: {}".format(find_similar_users(1)[:10]))
print("The 5 most similar users to user 3933 are: {}".format(find_similar_users(3933)[:5]))
print("The 3 most similar users to user 46 are: {}".format(find_similar_users(46)[:3]))
The 10 most similar users to user 1 are: [3933, 23, 3782, 203, 4459, 3870, 131, 4201, 46, 5041] The 5 most similar users to user 3933 are: [1, 23, 3782, 203, 4459] The 3 most similar users to user 46 are: [4201, 3782, 23]
3. Now that I have a function that provides the most similar users to each user, I want to use these users to find articles I can recommend. The functions below return the articles I would recommend to each user.
def get_article_names(article_ids, df=df):
'''
INPUT:
article_ids - (list) a list of article ids
df - (pandas dataframe) df as defined at the top of the notebook
OUTPUT:
article_names - (list) a list of article names associated with the list of article ids
(this is identified by the title column)
'''
# Convert article_ids to Python float object
article_ids = list(map(float, article_ids))
# Create a dataframe from df with the article_id's and title not repeated
unique_articles_df = df.drop_duplicates(subset='article_id')[['article_id', 'title']].set_index('article_id')
# Get the title of the each article_id in unique_articles_df
article_names = unique_articles_df.loc[article_ids]['title'].tolist()
# Return the article names associated with list of article ids
return article_names
def get_user_articles(user_id, user_item=user_item):
'''
INPUT:
user_id - (int) a user id
user_item - (pandas dataframe) matrix of users by articles:
1's when a user has interacted with an article, 0 otherwise
OUTPUT:
article_ids - (list) a list of the article ids seen by the user
article_names - (list) a list of article names associated with the list of article ids
(this is identified by the doc_full_name column in df_content)
Description:
Provides a list of the article_ids and article titles that have been seen by a user
'''
# Get the data associated with the user whose id is `user_id`
user_data = user_item.loc[user_id]
# Get the indexes which indicates that `user_id` has interacted with an article.
user_art_inter_index = np.where(user_data == 1)
# Get the `article_id`'s that user_id has interacted with
user_article_ids = user_data.index[user_art_inter_index].values
# Get the titles of the articles that user_id has interacted with
article_names = get_article_names(user_article_ids)
# Return the article ids and and their names
return user_article_ids, article_names
def user_user_recs(user_id, m=10):
'''
INPUT:
user_id - (int) a user id
m - (int) the number of recommendations you want for the user
OUTPUT:
recs - (list) a list of recommendations for the user
Description:
Loops through the users based on closeness to the input user_id
For each user - finds articles the user hasn't seen before and provides them as recs
Does this until m recommendations are found
Notes:
Users who are the same closeness are chosen arbitrarily as the 'next' user
For the user where the number of recommended articles starts below m
and ends exceeding m, the last items are chosen arbitrarily
'''
# recs is expected to contain at most the first `m`-article_id's to be recommended for user_id
recs = []
# Get the list of users with similar article interaction as the user whose id is `user_id`
similar_users = find_similar_users(user_id)
# Get the article_id and the article names that user_id has interacted with
user_articles_ids, user_articles_names = get_user_articles(user_id)
# For each user in similar_users
for similar_user in similar_users:
# Get the article_id and the article names that similar_user has interacted with.
similar_user_article_ids, similar_user_article_names = get_user_articles(similar_user)
# Get the articles that have not been interacted by user_id
rec_article_ids = set(similar_user_article_ids).difference(set(user_articles_ids))
rec_article_ids = rec_article_ids.difference(set(recs))
# Update recs with `rec_article_ids`
recs.extend(list(rec_article_ids))
# If the number of `article_id`'s in `recs` is at least `m`
if len(recs) >= m:
break
# Set recs to the `m`-elements in `recs`
recs = recs[:m]
# return the `article_id`'s to be recommended for this user_id
return recs
# Check Results
get_article_names(user_user_recs(1, 10)) # Return 10 recommendations for user 1
['data tidying in data science experience', 'this week in data science (april 18, 2017)', 'shaping data with ibm data refinery', 'fertility rate by country in total births per woman', 'data science platforms are on the rise and ibm is leading the way', 'timeseries data analysis of iot events by using jupyter notebook', 'got zip code data? prep it for analytics. – ibm watson data lab – medium', 'higher-order logistic regression for large datasets', 'from scikit-learn model to cloud with wml client', 'from spark ml model to online scoring with scala']
# Test your functions here
assert set(get_article_names(['1024.0', '1176.0', '1305.0', '1314.0', '1422.0', '1427.0'])) == set(['using deep learning to reconstruct high-resolution audio', 'build a python app on the streaming analytics service', 'gosales transactions for naive bayes model', 'healthcare python streaming application demo', 'use r dataframes & ibm watson natural language understanding', 'use xgboost, scikit-learn & ibm watson machine learning apis']), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_article_names(['1320.0', '232.0', '844.0'])) == set(['housing (2015): united states demographic measures','self-service data preparation with ibm data refinery','use the cloudant-spark connector in python notebook']), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_user_articles(20)[0]) == set(list(map(float, ['1320.0', '232.0', '844.0'])))
assert set(get_user_articles(20)[1]) == set(['housing (2015): united states demographic measures', 'self-service data preparation with ibm data refinery','use the cloudant-spark connector in python notebook'])
assert set(get_user_articles(2)[0]) == set(list(map(float, ['1024.0', '1176.0', '1305.0', '1314.0', '1422.0', '1427.0'])))
assert set(get_user_articles(2)[1]) == set(['using deep learning to reconstruct high-resolution audio', 'build a python app on the streaming analytics service', 'gosales transactions for naive bayes model', 'healthcare python streaming application demo', 'use r dataframes & ibm watson natural language understanding', 'use xgboost, scikit-learn & ibm watson machine learning apis'])
print("If this is all you see, you passed all of our tests! Nice job!")
If this is all you see, you passed all of our tests! Nice job!
4. Let us improve the consistency of the user_user_recs function from above.
def get_top_sorted_users(user_id, df=df, user_item=user_item):
'''
INPUT:
user_id - (int)
df - (pandas dataframe) df as defined at the top of the notebook
user_item - (pandas dataframe) matrix of users by articles:
1's when a user has interacted with an article, 0 otherwise
OUTPUT:
neighbors_df - (pandas dataframe) a dataframe with:
neighbor_id - is a neighbor user_id
similarity - measure of the similarity of each user to the provided user_id
num_interactions - the number of articles viewed by the user - if a u
Other Details - sort the neighbors_df by the similarity and then by number of interactions where
highest of each is higher in the dataframe
'''
# Your code here
## Compute similarity of each user to the provided user. This involves the following steps.
# Get the data associated with the user whose id is `user_id`.
# Note that this encodes the articles that the user has interracted with.
user_data = user_item[user_item.index==user_id].values
# Take the dot product of the data associated with the user whose id is `user_id`, i.e., `user_data`,
# with all the users in the dataframe `user_item`. Note that `user_item`.values is encoded as a matrix
# with number of users as rows and number of articles as column. Hence, taking the transponse will
# will ensure the dot product of the data associated with `user_id` with all the other users.
user_data_dot_all_users = np.dot(user_data, user_item.values.T)[0]
## sort by similarity
# Get index of the dataframe object `user_item`
users_indexes = user_item.index.values
# Form a Series object with the `user_data_dot_all_users`. Note that the positions of the entries of
# `user_data_dot_all_users` corresponds to that of `users_indexes`. In effect the similarity of the
# user whose id is `user_id` to the i-th user, namely, `user_indexes[i]` is given by
# `user_data_dot_all_users[i]`.
user_users_similarity = pd.Series(user_data_dot_all_users, index=users_indexes)
# Get a count of the articles that each user has interacted with.
users_articles_intertns = df.groupby(['user_id'])['article_id'].count()
# Take only those the users in `users_indexes`
users_articles_intertns = users_articles_intertns.loc[users_indexes]
# Form a dataframe with users_indexes as index and user_users_similarity and users_articles_interactions
# as columns.
neighbors_df = pd.DataFrame({'neighbor_id': users_indexes, 'similarity': user_users_similarity,
'num_interactions': users_articles_intertns}).set_index('neighbor_id')
# Remove the data that correspond to the `user_id` from the dataframe `neighbors_df`.
neighbors_df.drop(labels=[user_id], inplace=True)
# Sort the columns of `neighbors_df` in descending order.
neighbors_df.sort_values(by=['similarity', 'num_interactions'], ascending=[False, False], inplace=True)
# Return the dataframe specified in the doc_string
return neighbors_df
def user_user_recs_part2(user_id, m=10):
'''
INPUT:
user_id - (int) a user id
m - (int) the number of recommendations you want for the user
OUTPUT:
recs - (list) a list of recommendations for the user by article id
rec_names - (list) a list of recommendations for the user by article title
Description:
Loops through the users based on closeness to the input user_id
For each user - finds articles the user hasn't seen before and provides them as recs
Does this until m recommendations are found
Notes:
* Choose the users that have the most total article interactions
before choosing those with fewer article interactions.
* Choose articles with the articles with the most total interactions
before choosing those with fewer total interactions.
'''
try:
# recs is expected to contain at most the first `m`-article_id's to be recommended for user_id
recs = []
# Get the users similar to `user_id`
top_users_df = get_top_sorted_users(user_id)
# Get the id of the users similar to `user_id`
similar_users = top_users_df.index.values
# Get the article_id and the article names that user_id has interacted with
user_articles_ids, user_articles_names = get_user_articles(user_id)
# Get the count of the articles-user-interactions
articles_users_intertns = df.groupby(['article_id'])['user_id'].count()
# For each user in similar_users
for similar_user in similar_users:
# Get the article_id and the article names that similar_user has interacted with.
similar_user_article_ids, similar_user_article_names = get_user_articles(similar_user)
# Get the articles that have not been interacted by user_id
rec_article_ids = set(similar_user_article_ids).difference(set(user_articles_ids))
# sort recommended articles by amount of interaction
rec_article_ids_sorted = articles_users_intertns.loc[list(rec_article_ids)].sort_values(ascending=False).index.to_list()
# Update recs with `rec_article_ids_sorted`
recs.extend(list(rec_article_ids_sorted))
# If the number of `article_id`'s in `recs` is at least `m`
if len(recs) >= m:
break
# Set recs to the `m`-elements in `recs`
recs = recs[:m]
rec_names = get_article_names(recs)
except:
recs = get_top_article_ids(10, df=df)
rec_names = get_top_articles(10, df=df)
return recs, rec_names
# Quick spot check - don't change this code - just use it to test your functions
rec_ids, rec_names = user_user_recs_part2(20, 10)
print("The top 10 recommendations for user 20 are the following article ids:")
print(rec_ids)
print()
print("The top 10 recommendations for user 20 are the following article names:")
print(rec_names)
The top 10 recommendations for user 20 are the following article ids: [1330.0, 1427.0, 1364.0, 1170.0, 1162.0, 1304.0, 1351.0, 1160.0, 1354.0, 1368.0] The top 10 recommendations for user 20 are the following article names: ['insights from new york car accident reports', 'use xgboost, scikit-learn & ibm watson machine learning apis', 'predicting churn with the spss random tree algorithm', 'apache spark lab, part 1: basic concepts', 'analyze energy consumption in buildings', 'gosales transactions for logistic regression model', 'model bike sharing data with spss', 'analyze accident reports on amazon emr spark', 'movie recommender system with spark machine learning', 'putting a human face on machine learning']
5. Let's test our function to find the some similar users to users 1 and 131.
# Find the user that is most similar to user 1
user1_most_sim = get_top_sorted_users(1).head(1).index[0]
# Find the 10th most similar user to user 131
user131_10th_sim = get_top_sorted_users(131).head(10).index[-1]
assert user1_most_sim == 3933
assert user131_10th_sim == 242
print("If this is all you see, you passed all of our tests! Nice job!")
If this is all you see, you passed all of our tests! Nice job!
6. Suppose there is a new user on the platform, what articles would we recommend to him to read?
If there is a new user on the platform, I will recommend to him the top 10 most read articles.
7. Let's use our existing functions, to provide the top 10 recommended articles for the a new user.
new_user = 0.0
recs_article_ids, rec_article_names = user_user_recs_part2(new_user, 10)
print("The top 10 articles that I will recommend for a new user are the following:")
pd.DataFrame({"article_id":recs_article_ids,
"article_names": rec_article_names
})
The top 10 articles that I will recommend for a new user are the following:
| article_id | article_names | |
|---|---|---|
| 0 | 1429.0 | healthcare python streaming application demo |
| 1 | 1330.0 | use deep learning for image classification |
| 2 | 1431.0 | apache spark lab, part 1: basic concepts |
| 3 | 1427.0 | predicting churn with the spss random tree alg... |
| 4 | 1364.0 | analyze energy consumption in buildings |
| 5 | 1314.0 | visualize car data with brunel |
| 6 | 1293.0 | use xgboost, scikit-learn & ibm watson machine... |
| 7 | 1170.0 | gosales transactions for logistic regression m... |
| 8 | 1162.0 | insights from new york car accident reports |
| 9 | 1304.0 | finding optimal locations of new store using d... |
In this part of the notebook, I will use matrix factorization to make article recommendations to the users on the IBM Watson Studio platform using the data given in the dataframe user_item.
# Load the matrix here
user_item_matrix = user_item
user_item_matrix.head()
| article_id | 0.0 | 2.0 | 4.0 | 8.0 | 9.0 | 12.0 | 14.0 | 15.0 | 16.0 | 18.0 | ... | 1434.0 | 1435.0 | 1436.0 | 1437.0 | 1439.0 | 1440.0 | 1441.0 | 1442.0 | 1443.0 | 1444.0 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| user_id | |||||||||||||||||||||
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 714 columns
# The number of zeros in the user_item_matrix data
(user_item_matrix == 0).sum().sum()
3642704
# The number of ones in the user_item_matrix data
(user_item_matrix == 1).sum().sum()
33682
The outputs of the two cells above indicates that the user_item_matrix is extremely sparse. The zeros almost everywhere. In particular, $99.08\%$ of the entries in the user_item_matrix is zero, whiles the remaining $0.92\%$ are ones. Thus, there are no missing values in user_item_matrix dataframe.
# The user_item_matrix has no missing values
user_item_matrix.isna().sum().sum()
0
2. Let's use the Singular Value Decomposition from numpy on the user_item_matrix.
# Perform SVD on the User-Item Matrix Here
u, s, vt = np.linalg.svd(user_item_matrix) # use the built in to get the three matrices
Note that, because user_item_matrix does not have any missing value, the np.linalg.svd function runs without any errors. If it had missing values, then we will have to resort the funk-svd algorithm.
3. Now for the tricky part, how do we choose the number of latent features to use? The code in the next cell below, indicates that increasing the number of latent features, decreases the error rate of making predictions for 1 and 0 values in the user_item_matrix dataframe. That gives us an idea of how the accuracy improves as we increase the number of latent features.
num_latent_feats = np.arange(10,700+10,20)
sum_errs = []
for k in num_latent_feats:
# restructure with k latent features
s_new, u_new, vt_new = np.diag(s[:k]), u[:, :k], vt[:k, :]
# take dot product
user_item_est = np.around(np.dot(np.dot(u_new, s_new), vt_new))
# compute error for each prediction to actual value
diffs = np.subtract(user_item_matrix, user_item_est)
# total errors and keep track of them
err = np.sum(np.sum(np.abs(diffs)))
sum_errs.append(err)
## Plotting with the matplotlib library
#plt.plot(num_latent_feats, 1 - np.array(sum_errs)/df.shape[0]);
#plt.xlabel('Number of Latent Features');
#plt.ylabel('Accuracy');
#plt.title('Accuracy vs. Number of Latent Features');
## Ploting with the plotly library
graph_fig = px.line(x=num_latent_feats,
y=1 - np.array(sum_errs)/df.shape[0],
title="Accuracy vs. Number of Latent Features",
template= "plotly_dark"
)
graph_fig.update_layout(title_x=0.5,
xaxis_title = "Number of Latent Features",
yaxis_title = "Accuracy",
)
graph_fig.show(config={'displaylogo': False})
4. From the above, we can't really be sure how many features to use, because simply having a better way to predict the 1's and 0's of the matrix doesn't exactly give us an indication of if we are able to make good recommendations. Instead, we might split our dataset into a training and test set of data, as shown in the cell below.
In the subsequent cells below, I will use the code from question 3 to understand the impact on accuracy of the training and test sets of data with different numbers of latent features. Using the split below:
df_train = df.head(40000)
df_test = df.tail(5993)
def create_test_and_train_user_item(df_train, df_test):
'''
INPUT:
df_train - training dataframe
df_test - test dataframe
OUTPUT:
user_item_train - a user-item matrix of the training dataframe
(unique users for each row and unique articles for each column)
user_item_test - a user-item matrix of the testing dataframe
(unique users for each row and unique articles for each column)
test_idx - all of the test user ids
test_arts - all of the test article ids
'''
user_item_train = create_user_item_matrix(df_train)
user_item_test = create_user_item_matrix(df_test)
test_idx = user_item_test.index.values
test_arts = user_item_test.columns.values
return user_item_train, user_item_test, test_idx, test_arts
user_item_train, user_item_test, test_idx, test_arts = create_test_and_train_user_item(df_train, df_test)
# Number of users we can predict in the test set
train_idx = user_item_train.index.values
num_test_users_pred = len(np.intersect1d(test_idx, train_idx))
num_test_users_pred
20
# Number of users in the test set that we are not able to make predictions for because of the cold start problem
num_unpred_test_users = len(test_idx) - num_test_users_pred
num_unpred_test_users
662
# Number of articles we can make predictions for in the test set
train_art_idx = user_item_train.columns.values
test_art_idx = user_item_test.columns.values
num_test_arts_pred = len(np.intersect1d(test_art_idx, train_art_idx))
num_test_arts_pred
574
# Number of articles in the test set we are not able to make predictions for because of the cold start problem
num_unpred_test_arts = len(test_art_idx) - num_test_arts_pred
num_unpred_test_arts
0
5. Now I use the user_item_train dataset from above to find U, S, and V transpose using SVD. Then I find the subset of rows in the user_item_test dataset that I can predict using this matrix decomposition with different numbers of latent features to see how many features makes sense to keep based on the accuracy on the test data. Note that, this will require combining what I did in questions 2 - 4 of this section.
Let us explore how well SVD works towards making predictions for recommendations on the test data.
# fit SVD on the user_item_train matrix
u_train, s_train, vt_train = np.linalg.svd(user_item_train) # fit svd similar to above then use the cells below
# Get the dimensions of u_train, s_train, and vt_train
u_train.shape, s_train.shape, vt_train.shape
((4487, 4487), (714,), (714, 714))
Let's see how well we can use the training decomposition to predict on test data
# find subset of u_train that are in test_idx
u_test = u_train[user_item_train.index.isin(test_idx),:]
u_test.shape
(20, 4487)
# find subset of vt_train that are in test_arts
vt_test = vt_train[:,user_item_train.columns.isin(test_arts)]
vt_test.shape
(714, 574)
# get subset of user_item_test that we can predict
test_users_w_pred_idx = np.intersect1d(test_idx, train_idx)
#test_users_w_pred_idx
user_item_test_subset = user_item_test[user_item_test.index.isin(test_users_w_pred_idx)]
user_item_test_subset.shape
(20, 574)
num_latent_feats = np.arange(10,700+10,20)
# For accuracy scores
train_accuracy_array = np.array([])
test_accuracy_array = np.array([])
# For F1 scores
train_f1_array = np.array([])
test_f1_array = np.array([])
# For precision scores
train_prec_array = np.array([])
test_prec_array = np.array([])
# For recall scores
train_recall_array = np.array([])
test_recall_array = np.array([])
# for each latent features in `num_latent_feats`
for k in num_latent_feats:
# Take only k latent features for both training and testing data
new_u_train, new_s_train, new_vt_train = u_train[:,:k], np.diag(s_train[:k]), vt_train[:k,:]
new_u_test, new_vt_test = u_test[:,:k], vt_test[:k,:]
# Take the dot product of the training and test data to get their predicted values.
user_item_train_pred = np.around(np.dot(np.dot(new_u_train, new_s_train), new_vt_train))
user_item_test_pred = np.around(np.dot(np.dot(new_u_test, new_s_train), new_vt_test))
# Set predictions in binary form, i.e., to either 0 and 1. This is because,
# we are predicting whether or not a users will interact with articles.
user_item_train_pred = np.clip(user_item_train_pred, 0, 1)
user_item_test_pred = np.clip(user_item_test_pred, 0, 1)
# Get accuracy score for the k-th latent feature in num_latent_feats
accuracy_train = accuracy_score(np.array(user_item_train).flatten(), user_item_train_pred.flatten())
accuracy_test = accuracy_score(np.array(user_item_test_subset).flatten(), user_item_test_pred.flatten())
#print("Accuracy Score for Training Data: {}\nAccuracy Score for Test Data: {}\n\n".format(accuracy_train, accuracy_test))
# Get f1 score for for the k-th latent feature in num_latent_feats
f1_train = f1_score(np.array(user_item_train).flatten(), user_item_train_pred.flatten())
f1_test = f1_score(np.array(user_item_test_subset).flatten(), user_item_test_pred.flatten())
#print("F1-Score for Training Data: {}\nF1-Score for Test Data: {}\n\n".format(f1_train, f1_test))
# Get precision score for for the k-th latent feature in num_latent_feats
prec_train = precision_score(np.array(user_item_train).flatten(), user_item_train_pred.flatten())
prec_test = precision_score(np.array(user_item_test_subset).flatten(), user_item_test_pred.flatten())
#print("Precision Score for Training Data: {}\nPrecision Score for Test Data: {}\n\n".format(prec_train, prec_test))
# Get precision score for for the k-th latent feature in num_latent_feats
recall_train = recall_score(np.array(user_item_train).flatten(), user_item_train_pred.flatten())
recall_test = recall_score(np.array(user_item_test_subset).flatten(), user_item_test_pred.flatten())
#print("Recall Score for Training Data: {}\nRecall Score for Test Data: {}\n\n".format(recall_train, recall_test))
# Update `train_accuracy_array` and `test_accuracy_array` with `acc_train` and `acc_test` respectively
train_accuracy_array = np.append(train_accuracy_array, accuracy_train)
test_accuracy_array = np.append(test_accuracy_array, accuracy_test)
# Update `train_f1_array` and `test_f1_array` with `f1_train` and `f1_test` respectively
train_f1_array = np.append(train_f1_array, f1_train)
test_f1_array = np.append(test_f1_array, f1_test)
# Update `train_prec_array` and `test_prec_array` with `prec_train` and `prec_test` respectively
train_prec_array = np.append(train_prec_array, prec_train)
test_prec_array = np.append(test_prec_array, prec_test)
# Update `train_recall_array` and `test_recall_array` with `recall_train` and `recall_test` respectively
train_recall_array = np.append(train_recall_array, recall_train)
test_recall_array = np.append(test_recall_array, recall_test)
metric_scores_df = pd.DataFrame({"latent_features_num": num_latent_feats,
"train_accuracy_score" : train_accuracy_array,
"test_accuracy_score" : test_accuracy_array,
"train_f1_score" : train_f1_array,
"test_f1_score" : test_f1_array,
"train_prec_score" : train_prec_array,
"test_prec_score" : test_prec_array,
"train_recall_score" : train_recall_array,
"test_recall_score" : test_recall_array
}).set_index("latent_features_num")
metric_scores_df
| train_accuracy_score | test_accuracy_score | train_f1_score | test_f1_score | train_prec_score | test_prec_score | train_recall_score | test_recall_score | |
|---|---|---|---|---|---|---|---|---|
| latent_features_num | ||||||||
| 10 | 0.991735 | 0.978397 | 0.202458 | 0.074627 | 0.853479 | 0.200000 | 0.114851 | 0.045872 |
| 30 | 0.993505 | 0.976568 | 0.456642 | 0.123779 | 0.968006 | 0.213483 | 0.298797 | 0.087156 |
| 50 | 0.994605 | 0.975348 | 0.583418 | 0.129231 | 0.989939 | 0.196262 | 0.413580 | 0.096330 |
| 70 | 0.995586 | 0.973606 | 0.682032 | 0.126801 | 0.997173 | 0.170543 | 0.518248 | 0.100917 |
| 90 | 0.996367 | 0.972300 | 0.751860 | 0.131148 | 0.999547 | 0.162162 | 0.602549 | 0.110092 |
| 110 | 0.996959 | 0.970470 | 0.800352 | 0.124031 | 0.999591 | 0.142012 | 0.667339 | 0.110092 |
| 130 | 0.997482 | 0.969512 | 0.840130 | 0.125000 | 0.999859 | 0.137363 | 0.724405 | 0.114679 |
| 150 | 0.997898 | 0.968728 | 0.869968 | 0.126521 | 0.999956 | 0.134715 | 0.769888 | 0.119266 |
| 170 | 0.998228 | 0.968031 | 0.892564 | 0.128266 | 1.000000 | 0.133005 | 0.805973 | 0.123853 |
| 190 | 0.998512 | 0.967334 | 0.911330 | 0.125874 | 1.000000 | 0.127962 | 0.837104 | 0.123853 |
| 210 | 0.998746 | 0.966638 | 0.926308 | 0.123570 | 1.000000 | 0.123288 | 0.862732 | 0.123853 |
| 230 | 0.998954 | 0.966289 | 0.939287 | 0.122449 | 1.000000 | 0.121076 | 0.885525 | 0.123853 |
| 250 | 0.999141 | 0.965767 | 0.950660 | 0.120805 | 1.000000 | 0.117904 | 0.905960 | 0.123853 |
| 270 | 0.999287 | 0.965418 | 0.959373 | 0.119734 | 1.000000 | 0.115880 | 0.921918 | 0.123853 |
| 290 | 0.999404 | 0.965157 | 0.966302 | 0.118943 | 1.000000 | 0.114407 | 0.934800 | 0.123853 |
| 310 | 0.999500 | 0.964808 | 0.971840 | 0.117904 | 1.000000 | 0.112500 | 0.945223 | 0.123853 |
| 330 | 0.999580 | 0.964808 | 0.976443 | 0.117904 | 1.000000 | 0.112500 | 0.953971 | 0.123853 |
| 350 | 0.999645 | 0.964634 | 0.980189 | 0.117391 | 1.000000 | 0.111570 | 0.961147 | 0.123853 |
| 370 | 0.999709 | 0.964634 | 0.983836 | 0.117391 | 1.000000 | 0.111570 | 0.968186 | 0.123853 |
| 390 | 0.999781 | 0.964460 | 0.987843 | 0.116883 | 1.000000 | 0.110656 | 0.975977 | 0.123853 |
| 410 | 0.999817 | 0.964460 | 0.989886 | 0.116883 | 1.000000 | 0.110656 | 0.979975 | 0.123853 |
| 430 | 0.999851 | 0.964460 | 0.991766 | 0.116883 | 1.000000 | 0.110656 | 0.983666 | 0.123853 |
| 450 | 0.999879 | 0.964460 | 0.993344 | 0.116883 | 1.000000 | 0.110656 | 0.986776 | 0.123853 |
| 470 | 0.999907 | 0.964460 | 0.994865 | 0.116883 | 1.000000 | 0.110656 | 0.989783 | 0.123853 |
| 490 | 0.999928 | 0.964460 | 0.996038 | 0.116883 | 1.000000 | 0.110656 | 0.992106 | 0.123853 |
| 510 | 0.999943 | 0.964460 | 0.996863 | 0.116883 | 1.000000 | 0.110656 | 0.993747 | 0.123853 |
| 530 | 0.999962 | 0.964460 | 0.997911 | 0.116883 | 1.000000 | 0.110656 | 0.995831 | 0.123853 |
| 550 | 0.999976 | 0.964460 | 0.998683 | 0.116883 | 1.000000 | 0.110656 | 0.997369 | 0.123853 |
| 570 | 0.999985 | 0.964460 | 0.999196 | 0.116883 | 1.000000 | 0.110656 | 0.998394 | 0.123853 |
| 590 | 0.999991 | 0.964460 | 0.999521 | 0.116883 | 1.000000 | 0.110656 | 0.999043 | 0.123853 |
| 610 | 0.999993 | 0.964460 | 0.999624 | 0.116883 | 1.000000 | 0.110656 | 0.999248 | 0.123853 |
| 630 | 0.999996 | 0.964460 | 0.999761 | 0.116883 | 1.000000 | 0.110656 | 0.999522 | 0.123853 |
| 650 | 0.999999 | 0.964460 | 0.999949 | 0.116883 | 1.000000 | 0.110656 | 0.999897 | 0.123853 |
| 670 | 1.000000 | 0.964460 | 1.000000 | 0.116883 | 1.000000 | 0.110656 | 1.000000 | 0.123853 |
| 690 | 1.000000 | 0.964460 | 1.000000 | 0.116883 | 1.000000 | 0.110656 | 1.000000 | 0.123853 |
def metric_plot(num_latent_feats=num_latent_feats,
metric_train_data=train_accuracy_array,
metric_test_data=test_accuracy_array,
train_data_name="Train Accuracy",
test_data_name="Test Accuracy",
plot_title = "Training and Testing Accuracy Scores verses Number of Latent Features"):
"""
INPUT:
- num_latent_feats - An array of integers of dimention 1-by-k which is the number of latent features.
- metric_train_data - An array of floats of dimention 1-by-k which is the data obtained from one of the following metrics:
`accuracy_score`, `f1_score`, `precision_score` and `recall_score`.
- metric_test_data - An array of floats of dimention 1-by-k which is the data obtained from one of the following metrics:
`accuracy_score`, `f1_score`, `precision_score` and `recall_score`.
- train_data_name - A string object which is the name to be used in the legend section of the plot figure for the training
data.
- test_data_name - A string object which is the name to be used in the legend section of the plot figure for the testing
data.
- plot_title - A string object which is the title of the plot figure.
OUTPUT:
- None - NoneType
"""
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])
## Add traces
# For the metric on the training data
fig.add_trace(
go.Scatter(x=num_latent_feats,
y=metric_train_data,
name=train_data_name
),
secondary_y=False,
)
# For the metric on the test data
fig.add_trace(
go.Scatter(x=num_latent_feats,
y=metric_test_data,
name=test_data_name
),
secondary_y=True,
)
# Add figure title
fig.update_layout(title_x=0.5,
title_text=plot_title,
template= "plotly_dark"
)
# Set x-axis title
fig.update_xaxes(title_text="Number of Latent Features")
# Set y-axes titles
fig.update_yaxes(title_text="<b>" + train_data_name + "</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>" + test_data_name + "</b>", secondary_y=True)
fig.show(config={'displaylogo': False})
# Plotting for accuracy scores and number of latent features
metric_plot()
# Plotting for F1 scores and number of latent features
metric_plot(metric_train_data=train_f1_array,
metric_test_data=test_f1_array,
train_data_name="Train F1-Score",
test_data_name="Test F1-Score",
plot_title="Training and Testing F1-Scores verses Number of Latent Features"
)
# Plotting for precision scores and number of latent features
metric_plot(metric_train_data=train_prec_array,
metric_test_data=test_prec_array,
train_data_name="Train Precision Score",
test_data_name="Test Precision Score",
plot_title="Training and Testing Precision Scores verses Number of Latent Features"
)
# Plotting for recall scores and number of latent features
metric_plot(metric_train_data=train_recall_array,
metric_test_data=test_recall_array,
train_data_name="Train Recall Score",
test_data_name="Test Recall Score",
plot_title="Training and Testing Recall Scores verses Number of Latent Features"
)
6. Given the circumstances of my results, I discuss what I will do to determine if the recommendations I make with any of the above recommendation systems are an improvement to how users currently find articles.
The highly sparse nature of the user_item_matrix illustrates a highly unbalanced interaction between the users and the articles. That explains why a higher proportion of $99.08\%$ of the entries in the user_item_matrix is zero, whiles the remaining $0.92\%$ of the entries are ones. Consequently, over-fitting is very likely to occur because the model will get reasonably good accuracy in predicting the absence of interaction between the users and the articles. That could be the reason why the accuracy score is much higher than the ${F_{1}}$-score on the test data. To reduce the over-fitting, one can reduce the number of latent features, increase the majority class, use cross-validation, or get more data. To reduce the over-fitting, one can reduce the number of latent features, increase the majority class, use cross-validation, or get more data. The plot of the ${F_{1}}$-scores of both the training and testing data against the number of latent features indicates that decreasing the number of latent features improves the ${F_{1}}$-scores for the test data. However, the improvement is not substantial. More precisely, with a number of latest features as 90, the highest ${F_{1}}$-score recorded was $13.11\%$. Thus, one may have to resort to the other suggested methods mentioned earlier.
After splitting the data into training and testing data, one can observe that only $20$ users were in both of them. Thus, the test results might not be an accurate representation of how well the recommendations are performing. As an antidote, one may want to seek more data.
The given metric that records the user-article interactions does not provide more information. For example, it is hard to determine whether users fail to interact with articles because they did not like the articles, or they did not get the opportunity to read them, or for some other reason. A better metric I will suggest is for users to rate the articles they read. However, there is always a higher possibility that users might like articles they view. Therefore, the low ${F_{1}}$-score for the test data does not necessarily imply that the recommendations were below average.
Finally, since the platform is online, one can also test the performance of the recommendations through an A/B test. Namely, half of the users be assigned to the experimental group, whiles the other half to the old recommendation system as the control group. In the experimental design, one could request that users rate the articles they read. In addition, the experiment could internally generate the average reading time that users spend in reading an article. These two metrics could perhaps help to make better recommendations than the old recommendation system.